A novel difficult-to-segment samples focusing network for oral CBCT image segmentation

Using deep learning technology to segment oral CBCT images for clinical diagnosis and treatment is one of the important research directions in the field of clinical dentistry. However, the blurred contour and the scale difference limit the segmentation accuracy of the crown edge and the root part of the current methods, making these regions become difficult-to-segment samples in the oral CBCT segmentation task. Aiming at the above problems, this work proposed a Difficult-to-Segment Focus Network (DSFNet) for segmenting oral CBCT images. The network utilizes a Feature Capturing Module (FCM) to efficiently capture local and long-range features, enhancing the feature extraction performance. Additionally, a Multi-Scale Feature Fusion Module (MFFM) is employed to merge multiscale feature information. To further improve the loss ratio for difficult-to-segment samples, a hybrid loss function is proposed, combining Focal Loss and Dice Loss. By utilizing the hybrid loss function, DSFNet achieves 91.85% Dice Similarity Coefficient (DSC) and 0.216 mm Average Symmetric Surface Distance (ASSD) performance in oral CBCT segmentation tasks. Experimental results show that the proposed method is superior to current dental CBCT image segmentation techniques and has real-world applicability.


Related works
U-Net is a well-known network utilized for medical image segmentation tasks 13 .Its U-shaped design and inclusion of skip connections allow for effective extraction and fusion of multi-scale features.Consequently, this architecture demonstrates enhanced segmentation performance and robustness in a variety of medical image segmentation tasks, such as brain 14 , lung 15 , and heart 16 .The U-Net network structure comprises two main components: an encoder and a decoder.The encoder, consisting of a convolutional network with four downsampling modules, reduces the image dimension while extracting image features.On the other hand, the decoder employs a four-layer deconvolution module for upsampling, restoring the feature map to its original resolution.Furthermore, U-Net utilizes skip connections to merge each upsampled feature map with the corresponding feature map extracted during downsampling, thereby facilitating the fusion of shallow and deep feature information.
U-Net has proven to be a simple and effective model for medical image segmentation.To further enhance its performance, recent advancements in medical image segmentation networks have built upon the structural characteristics of U-Net by introducing modifications or additional modules.One such example is Attention U-Net, which incorporates attention gates into the original skip connection 17 .This introduces an attention mechanism that allows the network to focus more on important regions within the feature map.Another improvement is seen in TransUNet 18 and Swin-UNet 19 , where the Transformer architecture is integrated into medical image segmentation, replacing some or all of the convolution modules.This transformation enhances the network's ability to model long-range dependencies.Two commonly used image segmentation networks, SegNet 20 and DeepLab V3+ 21 , also share structural similarities with U-Net.Both employ encoder-decoder structures for downsampling and upsampling of feature maps.SegNet distinguishes itself by recording the index of the retained element in each Max pooling, allowing for accurate object boundary segmentation during upsampling.On the other hand, DeepLab V3+ utilizes Atrous Spatial Pyramid Pooling to achieve multi-scale feature extraction through different scale cavity convolutions.
Each of the traditional networks mentioned above has been enhanced based on the U-Net architecture.Despite demonstrating certain advantages in various medical image segmentation tasks, these traditional networks still exhibit shortcomings when applied to dental CBCT image segmentation.One prominent issue is the network's inadequacy in accurately difficult-to-segment samples in dental CBCT images, particularly those with small-scale and boundary-blurred characteristics.Consequently, further optimization of the network structure is imperative, along with an improvement in its ability to process such difficult samples, in order to address this problem.

Methodologies Overall architecture
We propose a novel network, DSFNet, for the segmentation of oral CBCT images, as illustrated in Fig. 1.This network incorporates the U-shaped structure, along with skip connections, featuring three downsampling and three upsampling stages.To enhance the overall performance of the model, we introduce the Feature Capture Module (FCM) within the skip connections.The FCM extracts both local and long-range feature representations simultaneously using both convolution and transformer operations.Additionally, we introduce the Multi-scale Feature Fusion Module (MFFM), which is placed before the segment head.The MFFM effectively fuses multiscale feature information using the attention mechanism, thereby reducing the noise caused by skip connections.This module addresses the challenges imposed by scale differences on difficult-to-segment samples.Furthermore, to assign higher weight to difficult-to-segment samples, we propose a new loss function that combines dice loss and focal loss.

Feature capture module
To enhance the performance of the network, a Feature Capture Module (FCM) is proposed in this work, which efficiently captures both local features and long-range features to enhance the feature extraction performance.The structure of this module is illustrated in Fig. 1a.
The proposed Feature Capture Module consists of two parts: the convolution block and the Transformer block, which respectively extract local and global features.The convolution block adopts a bottleneck structure.First, a convolution with a kernel size of 1 × 1 is used to reduce the channel dimension to one-fourth of the original channels.Then, the CBG module (Conv-BN-GELU) is employed to extract features.Finally, another convolution with a kernel size of 1 × 1 is applied to restore the channel dimension of the feature maps.The operations of the CBG module can be represented as: where f 3×3 represents a convolution operation with a kernel size of 3 ×3.The calculation of the Transformer block is as follows: where E is the projection matrix, E pos is the positional encoding, MSA(•) indicates the multi-head self-attention operation, and LN(•) represents layer normalization.Due to the heavy computational cost of self-attention opera- tions, this work utilizes the compressed feature maps in the bottleneck for computation.The calculation of the FCM's output, denoted as F out , given the input F , is defined as: where f 1×1 c/4 represents a convolution operation with a kernel size of 1 × 1 and output channels quarter of the input channels, f 1×1 c×4 represents a convolution operation with a kernel size of and output channels four times of the input channels, and F trans is calculated according to Eq. (2).

Multi-scale feature fusion module
In order to alleviate the impact of scale differences on segmentation accuracy, a Multi-scale Feature Fusion Module is proposed in this work, which utilizes attention mechanisms to construct intra-scale attention and inter-scale attention, thus enhancing the segmentation effect on challenging samples.The structure of this module is illustrated in Fig. 1b.
The input of this module is different scale feature map that have been interpolated to the same resolution, denoted as F i and i = 1, 2, 3, 4 .The MFFM have two pathways, the first pathway utilizes the Convolutional Block (2) www.nature.com/scientificreports/Attention Module (CBAM) to process the feature maps of each scale, constructing intra-scale attention to make the network focus more on important regions within the feature maps and obtain intra-scale attention features.This operation is defined as: where AvgPool c (•) and MaxPool c (•) denote as average pooling and max pooling along the channel dimension, while AvgPool s (•) and MaxPool s (•) are along the spatial dimension, f 7×7 (•) denote as the convolution opera- tion with a kernel size of 7 × 7 , and MLP( • ) is the multi-layer perceptron with shared parameters, ⊙ denotes Hadamard product.
In the other pathway, different scale features are concatenated along the channel dimension and then processed with a convolution operation to construct inter-scale attention, resulting in attention features with a channel dimension of 4. The attention features in each channel are multiplied with the intra-scale attention features obtained from the first pathway, and the concatenated results are fused using the CBG module to integrate features from different scales.The calculation is defined as: where f 7×7 c=4 represents the output feature map with a channel size of 4 and a convolutional kernel size of 7 × 7 .The MFFM effectively integrates features from different scales, further improving the model performance.

Mixed loss function
The boundaries of teeth and alveolar bone in oral CBCT images are blurred and difficult to identify, and the proportion of teeth and background occupying the image varies greatly.Therefore, the traditional cross-entropy loss function often fails to achieve good segmentation results.In order to solve the above problems, a Mixed Loss Function combining Focal Loss and Dice Loss is proposed in this work, as in where L D ′ denotes the improved Dice Loss, L F denotes Focal Loss, and α is the scale factor whose value range is [0,1].The calculation method of Focal Loss is shown in Eq. ( 7) below: where γ denotes to the relative loss of a moderator due to increasing the number of hard-to-classify samples.y t is calculated as follows: where y and ŷ denote the ground truth and predicted value.Dice Loss is a loss function commonly used to solve the problem of positive and negative sample imbalance in segmentation tasks, which is calculated as follows: where ǫ denotes the smoothing factor.Based on the idea of Focal Loss, this study improves Dice Loss to make it more focused on hard-to-segment samples, which is calculated as follows: The γ ′ ∈ [0, 1] also acts as a regulator of improved Dice Loss.When γ ′ ∈ (0, 1] , the loss of the more easily segmented samples will be less, and the ratio of the loss of hard-to-classify samples and easy-to-segment samples will increase, making the loss function more inclined to hard-to-segment samples and helping to improve the accuracy of segmentation of hard-to-segment samples.When γ ′ = 0 , the loss function degenerates to the origi- nal Dice Loss function.The effect of this improved loss function of Dice Loss is shown schematically in Fig. 2. As an example, for sample ŷ = 0.8 and sample ŷ = 0.5 , in the case of γ ′ = 0 (the original Dice Loss), L D,ŷ=0.8 = 0.111 and L D,ŷ=0.5 = 0.333 .The Loss ratio for both is 1:3.In the case of γ ′ = 1 , L D * ,ŷ=0.8 = 0.012 and L D,ŷ=0.5 = 0.111 .The Loss ratio of the two samples is about 1:9.It can be seen that the ratio of difficult sam- ples will be higher with the improved Dice Loss, and the model will focus more on reducing the loss of difficult samples during the training process.( 4)

Experiments Dataset and experiment details
The dataset contains 150 patients' CBCT images and corresponding annotations 22 .For data preprocessing, the dataset was processed in this study to standardize the scale of the input CBCT images.First, the voxel size of all CBCT data was uniformly resampled to 0.4 mm, and the 98 sets of CBCT images with higher resolution were selected and constructed as the experimental dataset.To mitigate the impact of metal artifacts, the intensity values of each CBCT scan were cropped to [0,2500].Subsequently, the voxel intensities were then normalized to the range of [0,1] following the standard protocol for image processing in deep learning.In addition, the original dataset is used for the instance segmentation task, and the binary classification task studied in this work involves the segmentation of teeth and background.Therefore, this study modifies the categories in the label to represent the background region with 0 and the teeth region with 1. Finally, the processed 98 datasets are divided into training set, validation set, and test set in a 6:2:2 ratio.The processed dataset of this work can be obtained by contacting the corresponding author.
In terms of experimental details, four NVIDIA A100 GPUs were used for training, the input image size was set to 224 × 224 , and the batch size was set to 128.The initial learning rate was set to 3e−4, and 50 epochs were trained using the AdamW optimizer with a weight decay of 0.05, while the learning rate was gradually reduced using the cosine learning rate scheduler to ensure convergence of each model.In this study, the model's performance of the model was verified using the validation set after each epoch, and the optimal weight parameters were saved and tested on the test set to obtain the final accuracy of segmentation.
Regarding the evaluation indices of segmentation, this study also adopts Dice Similarity Coefficient (DSC) and Average symmetric surface distance (ASSD) as evaluation indexes of segmentation results.DSC can better reflect the segmentation accuracy within the target region, while ASSD can better reflect the segmentation accuracy of the target boundary.

Comparison experiment
To test the effectiveness of DSFNet, firstly, it was trained on the oral CBCT image dataset and compared with the existing medical image segmentation network.The traditional Dice Loss was used for the loss function.the training loss and validation loss during training are shown in Figs. 3 and 4.
As shown in the above figures, after 50 epochs of training, all the trained models have converged.Under the comparison of the same training parameters, the DSFNet proposed in this work converges faster and the loss difference between the training and validation sets is lower, indicating that DSFNet has better generalization performance.The segmentation results of each network are shown in Table 1.
The segmentation results showed that the improved DSFNet model proposed in this study achieved a DSC score of 91.31% and an ASD of 0.249 mm, and the segmentation accuracy inside the tooth region and at the boundary was better than that of the existing model.The visualization of the segmentation results is shown in Fig. 5.
To verify the effectiveness of the Mixed Loss Function proposed in this work, set the conditioning factor γ = 2 of Focal Loss, and DSFNet is trained under the Mixed Loss Function scaling factor α ∈ {0, 0.5, 1} and the conditioning factor γ ′ ∈ {0, 0.5, 1} of the improved Dice Loss, respectively.The training results are shown in Table 2.The training results show that DSFNet can achieve the highest accuracy with α= 1 and γ ′ = 1 , which proves the effectiveness of Mixed Loss Function.The visual segmentation results of DSFNet trained by Dice Loss and Mixed Loss Function with α = 0.5 , γ ′ = 1 are shown in Fig. 6.

Ablation experiment
To verify the importance of each module in DSFNet, a set of ablation studies were conducted in this study.Firstly, FCM and MFFM in DSFNet were eliminated, and a baseline was trained using the original Dice loss function, as shown in Table 3.
Based on the baseline, FCM and MFFM were added for training and compared, as shown in NO. 1-3, respectively.The experimental results show that both FCM and MFFM can effectively improve the network performance.In addition, this study further compares the effect of using Mixed Loss Function, as shown in NO.   www.nature.com/scientificreports/

Figure 2 .
Figure 2. Schematic diagram of the effect of the loss function of the improved Dice Loss.

Figure 5 .
Figure 5.The visualization of the segmentation results used different network.

Figure 6 .
Figure 6.The visualization of the segmentation results used different loss function.

Table 1 .
Performances of different network.

Table 2 .
Performance under different loss function parameter.

Table 3 .
Results of ablation experiment.